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This report summarizes work performed under NASA Grant NA61-46 during 
the period Septanber 1, 1980 to February 28, 1981 and constitutes the second 
semi-annual report. This work has been done under the technical monitorship 
of David Loendorf of Langley Research Center. 

The primary accomplishment during this period has been the development 
of a multi-colored SOR program for the Finite Element Machine. The multi- 
colored SOR method, described in detail in the last semi-annual report, uses 
a generalization of the classical Red/Black grid point ordering for the SOR 
method. These "multi-colored" orderings have the advantage of allowing the 
SOR method to be implemented as a Jacobi method, which is ideal for arrays 
of processors, but still enjoy the greater rate of convergence of the SOR 
method. 

The multi-colored SOR program was written in PASCAL and has run success- 
fully on the current four processor version of the Finite Element Machine. 

The program has been written for a general nxn array of processors, however, 
and should be easily convertible to the 36 processor array when that is 
operational . 

The current program was written to solve a general second order 
self-adjoint elliptic problem on a square region with Dirichlet 
boundary conditions, discretized by quadratic elements on triangular 
regions. This problem is described in more detail in the Appendix to 
this report. We note that for this general problem and discretization, 
six colors are necessary (and sufficient) for the multi-colored method 
to operate efficiently. The specific problem that was solved using the 
six-color program was Poisson's equation; for Poisson's equation, three 
colors are necessary and sufficient but six may be used. 

In general, the number of colors needed will be a function of the 
differential equation, the region and boundary conditions, and the 
particular finite elements used for the discretization. One of the 
problems currently being investigated is how to determine in a systematic 
way the proper number of colors and the grouping of the unknowns into 
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the processors. Proper processor assigrmwnt Is, of course, crucial for 
the success of the method. In the Appendix a rough analysis is given of 
some of the questions that arise in processor assignment. The six and 
three color orderings may be found in Figures 1 and 2 respectively. 

The program has thus far been run on two test problems: 


Problem 1 : 

V V " 

in the unit square 


u « x^ + y^ 

on the boundary 

Problem 2: 

U + U =1 
XX yy 

in the unit square 


U = 0 

on the boundary 


Problem 1 is a standard mathematical test problem. Problem 2 is a model for 
a fixed boundary membrane and was suggested by the project monitor, David 
Loendorf. The results on both problems have been v^ry encouraging. 

In addition to the program development, a number of questions relating 
to the efficiency of the multi-color method have been investigated. 

1. Will the iterates produced by the multi-color SOR method converge? 

2. What is the rate of convergence of the method, especially as a function 
of the overrelaxation factor? 

3. How does one choose an optimum (or pood) overrelaxation factor? 

The answer to the first question is quite easy if the coefficient matrix 
of the system is positive definite. In this case, the multi-color ordering 
is just a permutation transformation of this matrix, and the permuted matrix 
is also positive definite and SOR will converge. 

The other two questions are more difficult. The coefficient matrix does 
not have Property A, the classical condition of Young which allows a corre- 
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spondence between the eigenvalues of the SOR Iteration matrix, and the Jacobi 
Iteration matrix. Nevertheless, the computer results so far Indicate that the 
rate of convergence of the multi-color SOR method applied to either Problem 1 
or 2 above behaves, as a function of the relaxation factor, very much like the 
SOR method applied to the classical 5-point finite difference discretization of 
the Laplace's equation. These experimental results are very encouraging, since 
SOR on the 5-point discretization is essentially a best case situation. Work 
Is continuing on attempting to understand why the experimental results are true. 

It is anticipated that 1r. the near future, additional more demanding proble 
will be attempted on the Finite Element Machine; in particular, Ms. Adams is 
planning to spend most of the summer of 1981 at Langley Research Center. We 
also plan to continue our Investigation of certain variants of the conjugate 
gradient method, as an alternative to the multi-colored SOR method. 



APPENDIX 


The general problem considered is the second order self-adjoint 
elliptic problem 

Lu(x,y) = -f(x,y) in ./L , the unit square 

u(x,y) = g(x,y) on 

where 

Lu = au “(bux)x “(cuy)y 
where a,b,c are functions of x and y. 


MATHEMATICAL SOLUTION (Quadratic Elements on Triangular Regions) 
j (au -(bux)x -(cuy)y) V = -fv 
Using the divergence theorem, the left side becomes! 

S (auv + bu^Vx + ) - 5 (bUxOi + cUyn2) v 

where n = n^ej^ + n262 is the normal to • 

Note that for Dirichlet boundary conditions, the natural boundary condition 
bu^nj + cUyn2 will always be zero on ?lSi. Therefore, the equation to be solved 

is, 

J (auv + bu^Vv + cUvV„) = J -fv 


-»y »y > 


Now, to solve Poisson's Equation, U + U = f , set a=0, b-c-1 to get, 

XX yy 


(1) 




j(uv+uv) = \ -fv 

r ' X X y y J 

-d_ 


Discretize the unit square with^M points in each row and colunsn including the 

boundary points. Let u ^ 'Jbp (^hi> ^ j 

C’l J = t bp ' 

whore the ^ 's are the basis functions for Calcrkin's Finite Element Method 


and th( 


'/b,; 


s represent those basis functions associated with boundary points. 
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With these substitutions and definitions^ (I) now becomes the following 
(2n-l)x(2n-l) system of linear equations for the solution of the interior 
points (values of u at discrete points H/2 units apart in both the x and 
y directions). 



0 


t 




By using the six basis tunctions for quadratic elements and performing 
the integration over a standard triangle,' the integrals of the coefficient 
matrix can be evaluated. Likewise, by using a numerical technique, the 
ini eg, t all. un the right side can be evaluated. 


CUNNlCTlVin 


llie n.ituM! of the basis functions (^*s) will cause the cuetficient 

matrix It) be spaiee. In particular, ^ can be nonzt'ro only on triangles 

the point ij is on. This imiilies that m . ■ is noiuero on eiUu'r 6 or 2 

r IJ / 7 , / 

triangles. This connectivity is illustrated for , and W ^ 

Im'Iowi 


V 
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that is non/i'ro on trianijlt's T^, T^t Tj» T^, and T^. 

Ho, uo, only * (%><h * ' ' ' ’ *^1B 

^ A >t 

tiord to bo evaluated. Thus the iterative equation for the next value of 


will only contain Uj^, U^i ... » and Uj^^. From the connectivity of ^ 

/c- a similar analysis can be made for the solutions of U^, U^, AND U,^. 
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COLOKING SCHEME 

In order to solve for any of the points, say for cxam|)le, the values 
of all the points tlfat point A is connected to must be known. These points 

must either have been calculated on a previous iteration or must have been 

calculated already on tlic current iteration. 

If processors are to be working in paralleJ., with each processor calculating 

values of a given number of these points, it is desired that the calculation he 

ordered so that each processor will not have to wait for values from other 

processors before computation can begin (no starvation in a sense). This 

can l)c accomplished by assigning to each point a color in such a way 'that the 

point will be a different color from all other points it is connected to. To 

illustrate this consider the single connectivity of the 5-star; 

• ll . 

Cliarly two rol<jrs, red and black, will produce the desired result. This is 
i I I usl rai id lielow: 


.R 


.R 

.U 

JJ 



.R 

.K 

III 

.K 

.li 

.li 

.!< 

. li 

.R 




s 


Note that each R point only needs values from B points* and Mch B point 
only needs values from each R point. Hence* all R points are calculated first* 
using B values that %»ere available from the last Iteration* and then all B 
points are calculated by using the R points that were calculated this iteration, 
nie effect is that two Jacobi stareeps constitute one Gauss-Siedel sweep. 

The same basic idea of coloring the points can be applied to the 
connectivity described earlier by using six colors. This is illustrated 
below and in Figure 1. Note from the figure that all the hypotenuse points 
can be the same color* R* because they are not connected. Likewise* all the 
vertical sides can be the same color* G* and all the horizontal size midponts 
can be the same color* B. Three extra colors are needed to color the vertices. 



Y 

G 

0 

G 

lUJ 

G 


I’KoCtSSOK ASSKiNMLNi 

Oil'' I iiii.'.i'lt'i ai.inii ill .iiisi (>nini; puiiiLS lo processors is t in; amount of 
work III. it will be roiiuircil for each processor. Ideally, each processor should 
b«’ kept busy at all titiies; that is, each processor should do the same fractitm 
of the work. In this <ase, the above statements say that ideally each processor 
should h.ive the same numlier ol points to ctalculatc. 


V 
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Another consideration is the algorithm that will be used by each proces'^ut . 
If the points can be assigned in such a way that the algorithm in each proceHsor 
will be the same (SIMD), the task of prograinming in parallel is greatly reduced, 
as well as the guar&tee that each processor will be doing its traction of thi' 
work. 

These considerations give motivation for assigning to each processor 
the same color structure. For the siniple connectivity of the 5-star and two 
colors and four processors, the assignment could be as follows! 


.R 1 

^ •« .-“J 

pr-nrT 

1 .R .B ^ 

^ .B .R 

^ .R .B 

.B ^ 


For the problem under consideration, please note from Figure 1 that 

f 

the color structure repeats every 6R by 6R square of points. Therefore, if 
proceir-ors are assigned t«i every 6R by (>R square, eaclv processor will have 
the ideiittf.il bU'ueture and an identical algor iUim. Tliis as.sir,ntiH‘nt is 
illu:.ltated lielow. Note Lhat because of the triaiigul ar i/al ion of the situate 
region, an odd numbi'r of points will * in each row and colunui. 
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A few words should be said about the boundary processors* All boundary 
processors have values that do not have to be calculated, values that need nol 
be sent to other processors, and values that need not be received from other pro- 
cessors. In this sense, the boundary processors could have a different algor iUim 
than the other processors. However, from a prograimning standpoint, it is 
easier to test to see if a processor is on the boundary and Lake appropriate 
action than to write a separate algoritiun for the boundary processor. The 
algorithm uses a PASCAL CASE statement to determine 

the type of processor. As written, the alRoritJim is SIMU, but the MIMD 
algorithm is easily extracted from the CASE s* itementr-. 


ANALYSIS OF PKOCLSSUK ASSIGNMENT 


First of all, since each processor is doing its fraction of the work, 

Uie best Kp('edup over a uniprocessor one can liope to obtain is O(Nl’), wi»ere 
NP = (K+1)^, the number of processors. In practice, however, this will not Ik 
realized since speedup will be slowed down by the processor inner communications 
(that is, data transmissions). An analytical expression for T(K), tiie 

number of local data transmissions needed when a K by K gtid of 6*R by 6*K 
proi’essors is use<l, Is derived below but 


H — # of rectangles 

II — width of each rectangle 
M — // r>f points in each row 

l< — determines I hi' <»H by hl< 
U-A — width ol original square 
K — it of hy (irt processors 

M/6H = K ♦(t.l<tl)/nl< or M = 6HK + OK I 

F-it M = 2(b-A)/H ♦ 1 


M “ 2n 
topolop.y 
( I lor 
in a row 


( 1 ) 

( 2 ) 


+ 1 - 

t his 
( s»*e 


interior plus 
boundary 

exan4>le) 

picture) 


( 2 ) 


e 


Combine (1) and (2) to c*^t 

R a (B-A)/H(JK + 3) = ft/(3K + 3) 

A 

Since K is odd, 3K + 3 is even« This in^ilies thct R must be chosen to be 
even and divisible tTy 3K + 3. At first this appears to be a roscriction^ but 
it is not really since, if a specified is not divisible by 3K + 3, it may 
be increased until it is, thereby improving the accuracy of the solution (Mote 
that ^ increasing means is decreasing). 

It is an easy task to add up the transmissions that occur between 
processors (two-way) to get for the K by K topology the followingt 

T(K) = - 6K + 12KR (3) 

Also, it is easy to get an expression for the number of tran^^iinission links 
(lines) needed! . 

LlK) = 4(2K^ + K) 

Of tl»oso links, almost namely will have either I or 2 transmissions ot>ly. 

It is interesting to note that the number of transmissions for tlie bR l<y fl; 
topulo(’,y is bounfled above and below by 3 and 2 times the numb(!r of transmis- ions 
for a 12K by 12R topology respectively: 

2T(K) < T(2K+l)<3T(K) 

nr (0) 

“’ijk X ’n|{ X ill! ^ ■*^12K x 12K 

A I 'u> f 

.‘'•l.(K) < l.(.'Kn) £ Ol.(K) 

(<•) 

III ' 

^*12H X I2k ^ S.H X 6K — ^^12k X 12R 




KquaLion (5) is oncourai'in(i» since rcducinc Ihe blocks from 12R ly 12R 
6R by 6H Kives a potential speedup of 4 in cou^>utation time but only require;; 
IteLwocn 2 -md 3 times as many data transmissions. If the architecture of Hi- 
parallel computer is '‘such that data transmission is rapid, a potential savinr.:- 
stiould occur. Also, if the architecture permits transmissions (sends and r< a i ivi-;;) 
to occur simultaneously wit h computation, the colorint; of the points sui'gesl ■; ili.ii 
many of ti’ese data transmissions cm occur while computation is being carri' d mit . 
In other words, the amount of slowdown due to the transmissions is very 
arclii lecture dependent and should be determined experimentally as well. 

DATA STHIK.ITIRL: LIStD lOK liACII TROCESSOR 

It was discovered tli.it writing the aJ.)>6rithm became fairly easy once 

* 

the "correct” data structure was discovered. aV first, it w.is believed that 
e.ich proci'ssor stiould only .store its color structure. However , ' thi .s ii'ca 
w.iN 'luiclily .ili.mdutK'il wh<*ii it was r«*alizr’<l that e.ieh pi'oee;;.';oi mu;il r«'ceiv<- 
v.ilui';; from otlK-r proe»'‘;;;or and the.se v.ilui'S were usr-d to per h.ips caleul.tl' 
iiiot <- i h ill one point in the color structure. It was di'cidi'd l.o u.s(> a 6R+J 1>) 'd:i I 
■in .ly store l.l.e 61! hy (d! color structure, om' column of receives from tin i,< ,;t 
processor, two columns ol receives from the east processor, one row of receivi ■. 
(ioi!i the ;.(iMt ti procc.-e.oi , .ind two rows of ri'ceivi's IToiii t.he iioi th processor, 
lielow is .111 i 1 1 icitr.it ion ul the data slimtun’ ii:;ed(l!-l in the i llu.sLi ation ) . 
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Note that this data structure applies to the boundaury processors as well, 
boundary values can be thought of as receives that are stored only once, 
the same data structure c«m be used in all the processors. 


Hence f 
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